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Model selection and sparse recovery are two important problems 
for which many regularization methods have been proposed. We study 
the properties of regularization methods in both problems under the 
unified framework of regularized least squares with concave penalties. 
For model selection, we establish conditions under which a regular- 
ized least squares estimator enjoys a nonasymptotic property, called 
the weak oracle property, where the dimensionality can grow expo- 
nentially with sample size. For sparse recovery, we present a sufficient 
condition that ensures the recoverability of the sparsest solution. In 
particular, we approach both problems by considering a family of 
penalties that give a smooth homotopy between Lo and Li penalties. 
We also propose the sequentially and iteratively reweighted squares 
(SIRS) algorithm for sparse recovery. Numerical studies support our 
theoretical results and demonstrate the advantage of our new meth- 
ods for model selection and sparse recovery. 

1. Introduction. Model selection and sparse recovery are two important 
areas that have attracted much attention of the researchers. They are differ- 
ent but related, and share some common ideas especially when dealing with 
large scale problems. Examples include the lasso [Tibshirani (1996)], SCAD 
[Fan (1997) and Fan and Li (2001)], Dantzig selector [Candes and Tao (2007)] 
and MCP [Zhang (2007)] in model selection, and the basis pursuit [Chen, 
Donoho and Saunders (1999)] and many other Li methods [Candes and Tao 
(2005, 2006) and Candes, Wakin and Boyd (2008)] in sparse recovery. The 
analysis of vast data sets with the number of variables p comparable to or 
much larger than the number of observations n frequently arises nowadays 
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in both areas and poses many challenges that are not present in smaller scale 
studies. Sparsity plays an important role in these large scale problems. It is 
often believed that only a small fraction of the data is informative, whereas 
most of it is noise. 

Consider the linear regression model 

(1) y = X/3 + e, 

where y is an n-dimensional vector of responses, X = (xi, . . . ,Xp) is an n x p 
design matrix, /3 = . . . ,/3p)"^ is an unknown p-dimensional vector of re- 
gression coefficients and e is an n-dimensional vector of noises. In the sparse 
modeling, we assume that a fraction of the true regression coefficients vector 
/^o = (/?o,i) • • • >/3o,p)"^ are exactly zero. In this paper we allow /9q to depend 
on n. We denote by OJIq = supp(/3q) the support of /3q, which is called the 
true underlying sparse model hereafter. Model selection aims to locate those 
predictors Xj with nonzero /3o,j, which are called true variables hereafter, 
and to give consistent estimate of (3^ on its support. Throughout the paper 
we consider deterministic design matrix X and assume the identifiability of 
/3o, in the sense that the equation X/3q = X/3, f3 G entails either (3 = (3q 
or 11/3 II > ||/9ollo- We denote by || • ||g the Lq norm on the Euclidean spaces 
q G [0, oo] . Many methods have been proposed in the literature to construct 
estimators that mimic the oracle estimators under different losses, where 
the oracle knew the true model OKq ahead of time. The main difficulty of 
recovering '>Mq lies in the coUinearity among the predictors, which increases 
as the dimensionality grows. See, for example, Fan and Li (2006) for a com- 
prehensive overview of challenges of high dimensionality in statistics. 

Much insight into model selection can be obtained if we understand a 
closely related problem of sparse recovery, which aims at finding the mini- 
mum Lq (sparsest possible) solution to the linear equation 

(2) y = X/3, 

where (3 = (/3i, . . . ,f3p)^ , y = X/3q and X and /3q are the same as those in 
the linear model (1). The identifiability of [3q assumed above ensures that 
our target solution here is unique and exactly [3q. When the p x p matrix 
X-^X is singular or close to singular, finding (3q is not an easy task. It is 
known that directly solving the Lo-regularization problem is combinatorial 
and, thus, is impractical in high dimensions. To attenuate this difficulty, 
many regularization methods such as the Li method of basis pursuit have 
been proposed to recover (3q, where continuous penalty functions are used 
in place of the Lq penalty. This raises a natural question: under what con- 
ditions does a regularization method give the same solution as that of the 
Lq regularization? Many authors have contributed to identifying conditions 
that ensure the Li/Lq equivalence. In this paper we generalize a sufficient 
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condition identified for the Li penalty to concave penalties, which ensures 
such an equivalence. 

In view of (1) and (2), model selection and sparse recovery can be re- 
garded as two interrelated problems. Due to the presence of noise, recovering 
the true model SUTq in (1) is intrinsically more challenging than recovering 
the sparsest possible solution /3q in (2). Various regularization methods in 
model selection such as those mentioned before have been studied by many 
researchers. See, for example, Bickel and Li (2006) for a comprehensive re- 
view of regularization methods in statistics. 

In a seminal paper, Fan and Li (2001) lay down the theoretical foundation 
of nonconvex penalized least squares and nonconcave penalized likelihood 
for variable selection, and introduce the concept of model selection oracle 
property. An estimator (3 is said to have the oracle property [Fan and Li 
(2001)] if: (1) it enjoys the sparsity in the sense of /35rjjc = with probability 
tending to 1 as n — > oo, and (2) it attains an information bound mimicking 
that of the oracle estimator [see also Donoho and Johnstone (1994)], where 
/3(jjtc is a subvector of /3 formed by components with indices in TIq, the com- 
plement of the true model 5Jlo- Fan and Li (2001) study the oracle proper- 
ties of nonconcave penalized likelihood estimators in the finite-dimensional 
setting. Their results were extended later by Fan and Peng (2004) to the 
setting of p = o(n^/^) or o(n^/^). In this paper we generalize the results of 
Fan and Li (2001) and Fan and Peng (2004), in the setting of regularized 
least squares, to a more general triple {s,n,p) for concave penalties, where 
s = WPqWq. In particular, we show that under some regularity conditions, 
the regularized least squares estimator enjoys a nonasymptotic weak oracle 
property, where the dimensionality p can be of exponential order in sample 
size n. This constitutes one of the main contributions of the paper. 

In this paper, we consider both problems of model selection and sparse 
recovery in the unified framework of regularized least squares with concave 
penalties. Specifically, for sparse recovery we construct the solutions of reg- 
ularization problems under the constraint in (2) by analyzing the solutions 
of related regularized least squares problems and then letting the regular- 
ization parameter A — > 0-|-. In particular, we consider a family of penalty 
functions that give a smooth homotopy between Lq and Li penalties for 
both problems. The unified approach using the Li penalty has been con- 
sidered by Chen, Donoho and Saunders (1999), Fuchs (2004), Donoho, Elad 
and Temlyakov (2006) and Tropp (2006), among others. 

The rest of the paper is organized as follows. In Section 2, we discuss the 
choice of penalty functions. We study the properties of regularization meth- 
ods in model selection and sparse recovery for concave penalties in Sections 
3 and 4. Section 5 discusses algorithms for solving regularization problems. 
In Section 6 we present four numerical examples using both simulated and 
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real data sets. Proofs are presented in Section 7. We provide some discussion 
of our results and their implications in Section 8. 

2. Regularization methods with concave penalties. In this paper, we 
study regularization methods in model selection and sparse recovery for con- 
cave penalties. For sparse recovery in (2), we consider the p-regularization 
problem 

p 

(3) min^ subject to y = X/3, 

i=i 

where p{-) is a penalty function and (3 = (/3i, . . . ,Pp)'^. For model selection 
in (1), we consider the regularized least squares problem 

(4) Si^^'lly - ^^"2 + ^nEPA„(|/3,|) 

I j=l 

where A„ € (0,cx)) is a scale parameter, Pa„(') is a penalty function, £ 
[0,oo) is a regularization parameter indexed by sample size n and f3 = 
(/?!, . . . , Pp)'^ ■ We will drop the subscript n when it causes no confusion. For 
any penalty function px, let p{t; A) = X^^p\{t) for t G [0, oo) and A G (0, oo). 
For simplicity, we will slightly abuse the notation and write p{t; A) as p{t) 
when there is no confusion. 

2.1. Penalty functions. By the nature of sparse recovery, the Lq penalty 
p{t) = I{t 7^ 0) is the target penalty in (3), whereas other penalties may also 
be capable of recovering (Bq. As mentioned before, the Lq penalty is not 
appealing from the computational point of view due to its discontinuity. It 
is known that the L2 penalty p{t) = in (3) or (4) is analytically tractable, 
but generally produces nonsparse solutions. Such concerns have motivated 
the use of penalties that are computationally tractable approximations or 
relaxations to the Lq penalty. Among all proposals the Li penalty p{t) = t, 
t G [0,00), has attracted much attention of the researchers in both sparse 
recovery and model selection. It has been recognized that the Li penalty is 
not an oracle that always points us to the true underlying sparse model. 

Hereafter, we consider penalty functions p{-) that satisfy the following 
condition. 

Condition 1. p(t) is increasing and concave in t G [0,oo), and has a 
continuous derivative p'{t) with p'{0+) G (0,oo). If p{t) is dependent on A, 
p'{t;X) is increasing in A G (0, 00) and p'{0+) is independent of A. 
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Fan and Li (2001) advocate penalty functions that give estimators with 
three desired properties — unbiasedness, sparsity and continuity — and pro- 
vide insights into them [see also Antoniadis and Fan (2001)]. We discuss the 
connection of Condition 1 with these properties. Consider problem (4) with 
n X 1 orthonormal design matrix and A„ = 1, 

(5) mm{2-\z-ef+px{m, 

where z = X'^y and px{t) = Xp{t), t G [0, oo). We denote by 6{z) the mini- 
mizer of problem (5). Fan and Li (2001) demonstrate that for the resulting 
estimator 6{z): (1) unbiasedness requires that the derivative p'xit) is close to 
zero when t S [0,(X)) is large, (2) sparsity requires p\{0+) > and (3) conti- 
nuity with respect to data z requires that the function t + p'-^{t), t G [0, oo) 
attains its minimum at f = 0. Note that the concavity of p in Condition 1 
entails that p' {t) is decreasing in t G [0, oo). Thus penalties satisfying Con- 
dition 1 and limj^oo p' {t) = enjoy the unbiasedness and sparsity. However, 
the continuity does not generally hold for all penalties in this class. The 
SCAD penalty [Fan and Li (2001)] px{t), t G [0, oo), is given by 

(6) p\{t) = xilit < A) + ~!h Ht > A)l for some a > 2, 

L (a-l)A J 

where often a = 3.7 is used, and MCP [Zhang (2007)] px{t), t G [0,oo), is 
given by p'^{t) = (oA - t)+/a. Both SCAD and MCP with a > 1 satisfy 
Condition 1 and the above three properties simultaneously. Although the 
Li penalty satisfies Condition 1 as well as sparsity and continuity, it does 
not enjoy the unbiasedness, since its derivative is identically one regardless 
of t G [0,oo). 

For a penalty function p, we define its maximum concavity as 

. ^ p'{t2)-p'{tl) 

(7) k{p) = sup 



tl,t2G{0,oo),ti<t2 h — tl 

and we define the local concavity of the penalty p at b = (6i, . . . , 6^)-^ G R'^ 
with ||b||o = q as 

f^^ ( u\ r p'{t2)-p'{ti) 
(8j bj = hm max sup . 

^^0+l^j'^9ti,t2e{|6j|-e,|fej|+e),ti<i2 *2 " 

By the concavity of p in Condition 1, we have < K{p;h) < k{p). It is 
easy to show by the mean-value theorem that k{p) defined in (7) equals 
supjg(o^oo) ~P"{^) ^-iid K,{p;h) defined in (8) equals maxi<j<q — p"(|6j|), pro- 
vided that p has a continuous second derivative p"{t). 
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Fig. 1. Plot of penalty functions po (Lq, thick solid), po.2 (dash-dot), pi (dashed), ps 
(dotted), and p^o {Li, thin solid). 



2.2. A family of penalties. Nikolova (2000) studies the transformed Li 
penalty function p{t) = bt/ (l + bt), t £ [0, oo) and 6 > 0. A slight modification 
of it gives a family of penalties {pa - a £ [0, oo]} given by, for a £ (0, oo), 

(9) p^it) = i^±i^ = ( -±-)l{t / 0) + (^)t, t e [0,oo), 



a + t \a + t J \a + t 

and 

(10) po{t)= lim pa{t) = I (t^O) and poo(t) = lim pa(t) = * 

for t G [0, oo). Figure 1 depicts pa penalties for a few a's. We see from (9) 
that this modified family has the interpretation of a smooth homotopy be- 
tween Lq and Li penalties. So we refer to them as the smooth integration 
of counting and absolute deviation (SICA) penalties. It is easy to show that 
Pa penalty with a G (0, oo] satisfies Condition 1, and for each o G (0,oo), 
liiRt^oo p'ait) = 0. Thus Pa with a G (0, oo) gives estimators satisfying the 
unbiasedness and sparsity. As mentioned before, the continuity requires 
that the function t + Xp'^{t), t G [0,oo), attains its minimum at t = 0; that 
is, a G [ao,oo], where oq = A + VA^ + 2A. Therefore, pa penalties pa with 
aG [ao,oo) satisfy Condition 1 and the above three properties simultane- 
ously, and share the same spirit as SCAD and MCP. In addition, pa is 
infinitely differentiable on [0, oo) for each a G (0, oo]. The idea of linearly 
combining Lq and Li penalties was investigated by Liu and Wu (2007). 
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For each a £ (0,oo), the pa penahy is closely related to the log penalty 
Pi,a{i) = (« + !) log(l + a-H), t e [0, oo). 

In fact, the Li penalty is the first-order approximation to both a{a + l)~^ pa{t) 
and a{a + and always dominates them. Also, we have Pa{t) = 

tp^ait). Clearly, ' 

p'ait) = ^aVtP ' *e(0,oo) for ae(0,oo), 

Pa{0+) = l + a foras(0,cx)) and p^{t) = l. 

It is easy to see that the maximum concavity of pa penalty is 

// , s 2a(a +1) ,1 9, 

(12) K{pa)= sup -p'^{t)= sup \ ' =2(a-^+a-^), 

which is the maximum curvature of the curve pa- Clearly, K{pa) is decreasing 
in a, lima-»o+ t^iPa) = oo, and lima-»oo ^{pa) = = k{poo). Therefore, param- 
eter a controls the maximum concavity of pa and regulates where it stands 
between Lq and Li penalties. 

3. Sparse recovery. In this section we consider the p-regularization prob- 
lem (3) for sparse recovery in (2). It is known that when the nxp matrix X 
is of full column rank p, (2) has a unique solution (3 = (X^X)-^X^y. Other- 
wise, it has an infinite number of solutions, all of which form a (/-dimensional 
linear subspace 

(13) ^ = {/3GRP:y = X/3} 

of with q=p — rank(X). Of interest is the nontrivial case of g > 0. 

3.1. Identifiability of (3q. As mentioned in the Introduction, the mini- 
mum Lq solution to (2) is 

(14) /3o = argmin||/3||o. 

Donoho and Elad (2003) introduce the concept of spark and show that 
the uniqueness of /3q can be characterized by the spark(X) of X, where 
r = spark(X) is defined as the smallest possible number such that there 
exists a subgroup of r columns from the nxp matrix X that are linearly 
dependent. The spark of a matrix can be very different from its rank. For 
instance, the n x [n + 1) matrix [/nGi] is of full rank n and yet has spark 
equal to 2, where ei = (1, 0, . . . , 0)"^. In particular, they proved that any 
£ A with 11/3 llo < spark(X)/2 meets /3q. Thus /3q is unique as long as 
||/3ol|o<spark(X)/2. 
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3.2. L2 penalty. When the p penalty is taken to be the L2 penalty in 
the p-regularization problem (3), its minimizer is given by 

(15) /32 = argmin||/3||2. 

It admits a closed- form solution. Viewing in the linear regression setting, we 
know that [see Theorem 6.2.1 in Fang and Zhang (1990)] the least squares 
estimate (X-^X)"'"X^y is a solution to the normal equation X-^y = X-'^X/J, 
where (•)"'" denotes the Moore-Penrose generalized matrix inverse. The fol- 
lowing proposition shows that it coincides with 

Proposition 1 (Minimum L2 solution). f32 = (X^X)+X^y. 

However, the minimum L2 solution (32 to (2) is generally nonsparse and, 
thus, is different from the minimum Lq solution /3q. 

3.3. Penalties satisfying Condition 1. We are curious about the p/Lq 
equivalence, in the sense that the minimizer of the p-regularization problem 
(3) meets the minimum Lq solution /3q. As mentioned in the Introduction, 
many researchers have contributed to identifying conditions that ensure the 
Li/Lo equivalence when p is taken to be the Li penalty. We consider penal- 
ties p satisfying Condition 1. It is generally difficult to study the global 
minimizer analytically without convexity. As is common in the literature, 
we study the behavior of local minimizers. 

Directly studying the local minimizer of the /9-regularization problem (3) 
is generally difficult. We take the idea of constructing a solution to (3) by 
analyzing the solution of related p-regularized least squares problem (4) 
with regularization parameter AG (0,oo) and then letting A 0+, where 
Px{t) = Xpit),t£[0,oo). 

We introduce some notation to simplify our presentation. For any 5 C {1, 
. . . ,p}, X5 stands for an n x ISj submatrix of X formed by columns with 
indices in S, hs stands for the subvector of b formed by components with 
indices in S and S'^ denotes its complement. For any vector b = (61, . . . , bq)'^ , 
define sgn(b) = (sgn(6i), . . . ,sgn(5g))-^, where the sign function sgn(2;) = 1 
if X > 0, -1 if X < and if x = 0. Let 

(16) p(t) = sgn(t)p'(|t|), teR, 

and p{h) = {p{bi), . . . ,p{bq)Y , b = (61, . . . , 6^)'^. Clearly, for aG (0, 00), 

(17) pa{t)=sgn{t)a{a+l)/{a + \t\f and poo{t) = sgn{t), t G R, 

where pa is defined in (9) and (10). 

The following theorem gives a sufficient condition on the strict local min- 
imizer of (4) for any n-vector y and nxp matrix X. 
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Theorem 1 (Regularized least squares). Assume that px satisfies Con- 
dition 1 and P G with Q = X^ nonsingular, where X G (0, oo) and 
^ ^A ^ ^ 

9JIa = supp(/3 ). Then j3 is a strict local minimizer of (4) with A„ = A if 

(18) PB, = Q^'^BiJ - AnAQ-V(3ltJ' 

(19) ||z^Joo<p'(0+), 

(20) Amin(Q) > A„Ak(p;3|jJ, 

where z = (A„A)~-'^X-^(y — X/3 ), Ainin(") denotes the smallest eigenvalue of 
a given symmetric matrix, and n{p',P^^) is given by (8). 

Conditions (18) and (20) ensure that /3 is a strict local minimizer of (4) 

when constrained on the ||o-dimensional subspace {/9 G R^ : = 0} of 

^\ 

R^. Condition (19) makes sure that the sparse vector P is indeed a strict 
local minimizer of (4) on the whole space R^. When p is convex, (19) and 

(20) can be, respectively, relaxed to no greater than and no less than under 

which P is a minimizer of (4). Due to the possible nonconvexity of p, the 
technical analysis for proving local minimizer needs the strict inequalities in 
(19) and (20). 

When p is taken to be the Li penalty, the objective function in (4) is 
convex. Then the classical convex optimization theory applies to show that 

P — iPiJ ■ ■ ■ iPp) is a global minimizer if and only if there exists a subgra- 

dient z G dLi{P ), such that 

(21) X^X3^ - X^y + A„Az = 0, 

where the subdifferential of the Li penalty is given by dLi{P ) = {z = 
(zi, . . . , Zp)'^ G R^ : Zj = sgn0j) for Pjj^O and zj G [-1, 1] otherwise}. Thus 
provided that Q = XJi X^s, is nonsingular with dJlx = supp(/3 ) , condition 

(21) reduces to (18) and (19) with < and p'{0+) replaced by < and 1, re- 
spectively, whereas by the nonsingularity of Q, condition (20) always holds 

for the Li penalty. However, to ensure that P is the strict minimizer we 
need the strict inequality in (19). These conditions for the Li penalty have 
been extensively studied by many authors, for example, Efron et al. (2004), 
Fuchs (2004), Tropp (2006), Wainwright (2006) and Zhao and Yu (2006), 
among others. 
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By analyzing the solution of (4) characterized by Theorem 1 and letting 
A 0+, we obtain the following theorem providing a sufficient condition 
which ensures that (3q is a local minimizer of (3). 

Theorem 2 (Sparse recovery). Assume that p satisfies Condition 1 with 
k{p) G [0, cxd), Q = X^^jXoto is nonsingular with OJIq = supp(/3o), and X = 
(xi,...,Xp). Then (3q is a local minimizer of (3) if there exists some e G 
(0,minjg9jto |/^o,j|) such that 

(22) maxmax|(x,-,u)| < p'(0+), 

where = {XOToQ^ip(v) : v G VJ and = rijesrKoi* • I* - /^Ojl < e}- 

We make some remarks on Theorem 2. Clearly condition (22) is free of the 
scale of the penalty function, that is, invariant under the rescaling p ^ cp 
for any constant cG (0,oo). Condition (22) is independent of the scale of 
X, and depends on /3q through $IRo and a small neighborhood Ve of /9oOTo' 
It can be viewed as a local condition at (9Jto;/3o OTq)- -^o^ the Li penalty 
Poo, we have /5'oo(0+) = 1 ^i^d for any e G (0, min^gOTo |/5o,j|)) contains a 
single point uq = XgjjoQ"^ sgn(/3Q g^^^). This shows that for the Li penalty, 
condition (22) reduces to the following condition 

(23) max |(xo,uo)| < 1, 

which can actually be relaxed to max^gsjjjc |(xj,Uo)| < 1 while ensuring the 
Li/Lq equivalence. In the context of model selection, condition (23) was 
called the weak irrepresentable condition in Zhao and Yu (2006), who in- 
troduced it for characterizing the selection consistency of lasso. For the pa 
penalty with a G (0, oo), by (11) we have 

/(0+) = l + a^^^oo and p'it) = "'j^^^]} ^0 asa^O+ 

[a + ty 

for each t G (0,oo), which shows that condition (22) is less restrictive for 
smaller a. This justifies the flexibility of the pa penalties. 

A great deal of research has contributed to identifying conditions on X and 
/3o that ensure the Li/Lq equivalence. See, for example, Chen, Donoho and 
Saunders (1999), Donoho and Elad (2003), Donoho (2004) and Candes and 
Tao (2005, 2006). In particular, Donoho (2004) shows that the individual 
equivalence of Li/Lq depends only on 97^0 and /3o«mo- Condition (23) is 
independent of the scale of X, and depends on (3q only through TIq and 
/3ogjto' sharing the same spirit. The idea of using weighted Li penalty in 
the /9-regularization problem (3) has been proposed and studied by Candes, 
Wakin and Boyd (2008). 
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3.4. Optimal pa penalty. Theorem 2 gives one characterization of the 
role of penalty functions in sparse recovery (2). In this section, we identify 
the optimal penalty pa for given X and (3q in sparse recovery. 

For any e E (0, min^gOTo l/^ojl)) we define 

(24) Ve = {All penalties pa in (3) that satisfy condition (22)}. 

By Theorem 2, any pa penalty in Ve ensures that Pq is recoverable in theory 
by the pa-regularization problem (3). We are interested in a penalty Paapt(e) 
that attains the minimal maximum concavity in the sense that 

(25) <Pa,^,{e)) = inf K{Pa)- 

Pa t ' e 

Such penalty Pa^ptie) makes the objective function in (3) have the minimal 
maximum concavity, which is favorable from the computational point of view 
since the degree of concavity is related to the computational difficulty. We 
thus call /Oaopt(e) the optimal penalty. The following theorem characterizes 
it. 



Theorem 3 (Optimal pa penalty for sparse recovery). Assume that Q = 
X^Jj^jX^o is nonsingular with TIq = supp(/9q) and e G (0, min^gOTo l/^o.jD- 
Then the optimal penalty Paopt(e) satisfies: 

(a) aopt(e) G (0,oo] and is the largest a G (0,oo] such that 

(26) max max I (xo, u)| < 1 + a~"^, 

where = {XotqQ-^^(v) : v G VJ and Ve = rijgOToi* • I* ~ /^o,j| < e}- 

(b) aopt(e) =00 if and only if 

(27) max|(xj,uo)| < 1, 
where uq = Xot„Q-^ sgn(/3o,OT(,). 



By the characterization of aopt(e)) we see that for any pa penalty with 
a G (O,aopt(e)), /3o always a local minimizer of (3). However, in view of 
(12), its maximum concavity increases to oo as a — > 0+. Theorem 3(b) makes 
a sensible statement that for sparse recovery, the Li regularization in (3) 
is favorable from the computational point of view if condition (27) holds, 
which, as mentioned before, entails the Li/Lq equivalence. We would like 
to point out that the optimal parameter aopt(e) depends on /3q and thus 
should be learned from the data. We give an example of calculating aopt(e) 
below. 
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Example 1. Assume that Xotq is orthonormal with ^SIq = {1, . . . ,s} 
and |/9o,i| = • • • = |/3o,s|- Thus by (2), 

s s 

(28) y = ^ /?ojXj = |/?o,i| J2 sgn(/3oj)xj. 

Let W be the hnear subspace of R" spanned by xi , . . . , and Ti'^ its orthog- 
onal complement. Further assume that p > s + 1, Xj G 7i-^ for each j > s + 2, 
||xs+i||2 = 1, and 



s 



(29) Xs+i = rs~i/2 ^ sgn(/3o,j)xj + h with r G (-1, 1) and h G 

i=i 

Then for any e G (0, |/3o,i|)i we have 

max max | (x^- , u) | = max | (x^+i , u) | = max | (x^+i , XOTo^a(v)) | 



for any |r| > s ^Z^, we have 
(30) aopt(6)- 



'|Via(a + l)/(a + |/3o,i| -e)^ 

a(a+l) 
*(a+|/3o,i|-e)" 



Thus condition (26) reduces to \r\\/s <l + a ^, which shows that 



We see that the optimal parameter Oopt (e) is related to \Po,i I — e through both 
r and s. It approaches oo as \r\ — > 5"^/^+ and goes to (|/3o,i| — e)/('S"^^^ ~ 1) 
as |r| ^ 1— (see Figure 2). When \r\ < s^^^"^ , we have aopt(e) = oo regardless 
of eG(0,|/3o,i|). 

In light of (28) and (29), r G (—1, 1) defining the noise predictor x^+i is 
exactly the correlation between x^+i and y. Therefore in the noiseless setting 
(2), when the number of true variables s is large, the correlation r between 
the noise variable Xjj+i and the response y has to be small in magnitude in 
order that the Li penalty is the optimal penalty, that is, aopt(e) = oo. Note 
that the cut-off point for \r\ by our theory is s^^/^, while s~^^'^ is exactly 
the absolute correlation between each true variable Xj and response y. 

If we assume |/3o,i| = minjeOTo \Po,j\ instead of \Po,j \ = • • • = \Po,s\, then the 
right-hand side of (30) gives a lower bound on aopt(e), which entails that the 
cut-off point for |r| by our theory can be above s"^/^. This result is sensible 
once we observe that 

I / M/l I J I / M ™aXjgOTo l/^OjL -1/2 

corr(Xs+i,y) < r and max corr(Xi,yl = — , > s ' , 

where y = J2i=i Poj^j- is interesting to observe that the right-hand side 
of (30) is positively related to min^grrjig \ Poj\, which measures the strength 
of the weakest additive component in the response. 
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Fig. 2. Plot of optimal parameter aopt(0+) in (30) with |/3o,i| = 1 against \r\ for s = 2 
(solid), s = lQ (dashed), s = 50 (dash-dot) and s = 500 (dotted). 

4. Model selection. In this section we consider the regularized least 
squares problem (4) for model selection in (1). Difficulties of recovering 
the true underlying sparse model 9Jto include the collinearity among the 
predictors and the computational cost, both of which increase as the dimen- 
sionality grows [see, e.g., Fan and Li (2006) and Fan and Lv (2008)]. For 
example, classical model selection approaches such as best subset selection 
are very demanding in computation and become impractical to implement 
in high dimensions. 

There is a huge literature on model selection. To name a few in addi- 
tion to those work mentioned before, Frank and Friedman (1993) propose 
the bridge regression. Breiman (1995) introduces the nonnegative garrote 
for shrinkage estimation and variable selection. Tibshirani (1996) proposes 
the lasso using the Li-regularized least squares. Oracle properties of non- 
concave penalized likelihood estimators including the SCAD [Fan and Li 
(2001)] have been systematically studied by Fan and Li (2001) and Fan and 
Peng (2004). In particular, Fan and Li (2001) propose a unified algorithm 
LQA for optimizing nonconcave penalized likelihood. Efron et al. (2004) 
introduce the least angle regression for variable selection and present the 
LARS algorithm. Zou and Li (2008) propose one-step sparse estimates for 
nonconcave penalized likelihood models and introduce the LLA algorithm 
for optimizing nonconcave penalized likelihood. Candes and Tao (2007) pro- 
pose the Dantzig selector and prove its nonasymptotic properties. Later, 
Meinshausen, Rocha and Yu (2007), Bickel, Ritov and Tsybakov (2008) and 
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James, Radchenki and Lv (2009) establish the equivalence or approximate 
equivalence of lasso and Dantzig selector under different conditions. More 
recent regularization methods include MCP proposed by Zhang (2007). 

Fan and Li (2001) point out the bias issue of lasso. Zou (2006) proposes 
the adaptive lasso to address this issue by using an adaptively weighted 
Li penalty. Greenshtein and Ritov (2004) study the persistency of lasso- 
type procedures in high-dimensional linear predictor selection. Hunter and 
Li (2005) propose and study MM algorithms for variable selection. Li and 
Liang (2008) study variable selection for semiparametric regression models. 
Wang, Li and Tsai (2007) study the problem of tuning parameter selection 
for the SCAD. Fan and Fan (2008) study the impact of high dimensionality 
on classifications and propose the FAIR. Fan and Lv (2008) propose the 
SIS as well as its extensions for variable screening and study its asymptotic 
properties in ultra-high-dimensional feature space. 

4.1. Regularized least squares estimator. We consider the regularized least 
squares problem (4) with the penalty px in the class satisfying Condition 
1. For a given regularization parameter A„ G [0,oo) indexed by sample size 

n, a p-dimensional vector /3 " is conventionally called a regularized least 
squares estimator of (3q if it is a (local) minimizer of (4). When the L2 
penalty p{t) = t^ is used, the resulting estimator is called the ridge esti- 
mator, and its limit as A„ — > 0-|- can be easily shown to be the ordinary 

least squares estimator (3 = (X^X)^X^y, where (•)"'' denotes the Moore- 

Penrose generalized matrix inverse. f3 is also a solution to the normal 
equation X^y = X-'^X/?. 

When A„ G (0,oo), Theorem 1 in Section 3.3 gives a sufficient condition 
on the strict local minimizer of (4). From the proof of Theorem 1, we see 

that any local minimizer /3 " of (4) must satisfy 

(31) = Q"'X|^/ - A„A„Q-ip(3|,J, 

(32) \\z- ||oo<p'(0+). 



(33) Amm(Q) > A„AnK(/>;/3k ), 

'■An 

where = supp(3^"), Q = Xg^ , z = (A„A„)-iX^(y - x3^"), 

OTa„ -'-"An 

and k{p;(3^ ) is given by (8). So there is generally a slight gap between 
the necessary condition for local minimizer and sufficient condition for strict 
local minimizer, in view of (32), (33) and (19), (20). Hereafter the regularized 

least squares estimator is referred to as a Z-estimator /3 " E that solves 
(31)-(33). 
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We observe that /3^^ in (31) is the difference between two terms. The 
first term Q~^X^ y is exactly the ordinary least squares estimator by 

using predictors xj with indices in In the case of orthonormal design 

matrix X, we have Q = with s„ = "||o, and thus the second term 
becomes h.n\np{(3£> ), which, for nonzero components, has the same sign 
as f3^ componentwise by definition. In view of (31), we have 



3^;„+AnA„/^(3|^J = X| y. 



which entails that both and AnXnp{P<£ ) (for its nonzero compo- 

nents) have the same sign as the ordinary least squares estimator X^ y. 

OTa„ 

This shows that the second term above is indeed a shrinkage towards zero 
when X is orthonormal. For the penalties pa introduced in Section 2.2, Pa{t) 
depends on both t and a [see (17)]. In fact, for small a, pa{t) takes a wide 
range of values when t varies, which ensures that pa penalty shrinks the 
components of the ordinary least squares estimator differently. This gives 
us flexibility in model selection. It provides us a family of regularized least 
squares estimators indexed by parameter a and regularization parameter . 
As a becomes smaller, it generally gives sparser estimates. 

4.2. Nonasymptotic properties. In this paper, we study a nonasymptotic 
property of /9 called the weak oracle property for simplicity, which means: 
(1) sparsity in the sense of (3<j^c = with probability tending to 1 as n ^ oo, 
and (2) consistency under the Lqo loss. This property is weaker than the 
oracle property introduced by Fan and Li (2001). As mentioned before, we 
condition on the n x p design matrix X. 

We use the p\ penalty in the class satisfying Condition 1 and make the 
following assumptions on the deterministic design matrix X and the distri- 
bution of the noise vector e in the linear model (1). 

Condition 2. X satisfies 

(34) ||(XotoXotJ~^||oo < Cin, 

(35) ||X^cX5utQ(X^(,Xrr)rto) "^||oo<C'2n, 



where TIq = supp(/3o), Cin G (0,cx)), C2„ G i^^^-^^^] for some C,co € 
(0, 1), bo = minjggjjo \(3oj\, and || • ||oo denotes the matrix oo-norm. 

Here and below, p is associated with regularization parameter A„ defined 
in (38) unless specified otherwise. 
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Condition 3. e ~ N{0, a'^In) for some cr > 0. 

When Xgjto is orthonormal, the left-hand side of (35) becomes ||X^cXgj[y 
the maximum absolute sum of covariances between a noise variable and all 
s true variables. Condition (35) constrains its growth rate. By the concav- 
ity of p in Condition 1, p'{t) is decreasing in t € [0,oo) and thus the ratio 
p'(0-|-)//?'(co^o) is always no less than one. When the Li penalty is used, 
C2n S [0, C] and Condition 2 reduces to the condition in Zhao and Yu (2006) 
and Wainwright (2006). 

Condition 2 is an assumption on design matrix X. If we work with random 
design, we can calculate the probability that Condition 2 holds by using 
the results from, for example, the random matrix theory. The Gaussian 
assumption in Condition 3 can be relaxed to other light-tailed distributions 
so that we can derive similar exponential probability bounds. 

Condition 4. There exists some 7 e (0, i] such that 



(36) 



+ 7(UTy^'" 



Ci„ = 0(n-T), 



where = max^ggjijlxj II2, -D2n = maxjggjfc ||xj II2 and X = (xi, . . . ,Xp). 
Let Un G (0, 00) satisfy lim„^oo Un = 00, A„ < A„,, and 

(37) Un < [Ko{C2nDin + D2n)r'Xnnni^mo^mo)i^ " C)p'(0+)a"S 

where 

^QQ\ X _ A-l{C2nDin + D2n)UnCr ^ _ CfJ (1 - Co)6o " '"■n-Di„CT 

A„ — A — — — — ana A„, — — - — - , 

p'{0+) - C2np'{cobo) A„p'(co6o; An) 

C, Co E (0, 1) are given in Condition 2, and kq = max{/t(p; b) : ||b — jSo^s^g ||oo !^ 
(1 — co)6o} with K,{p;h) given by (8). 

As seen later, we need the condition A„ < A„ to ensure the existence of 
a desired regularization parameter A„ = A„. Condition (37), which always 
holds when kq = 0, is needed to ensure condition (33). 

Theorem 4 (Weak oracle property). Assume that p\ in (4) satisfies 
Condition 1, Conditions 2-4 hold and p = o{une^"^^) . Then there exists a 
regularized least squares estimator j3 with regularization parameter — 
A„ defined in (38) such that with probability at least 1 — -^pu~"^e~""/^, /3 
satisfies: 

(a) (Sparsity) = 0' 
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(b) (Loo loss) 0mo- f^o,<}no\\°o<h = 0{n ^n„,), 

where TIq = supp(/3o) and h = [Din + ^yj^D2n]CinUn{l - C)~V. As a 
consequence, \\(3 " — (3q\\2 = Op{^/sn~'^Un), where s = \\Pq\\o. 

We make some remarks on Theorem 4. In view of lim„^oo Un = oo in 
Condition 4, the dimensionahty p is ahowed to grow up exponentially fast 
with Un- If in (37) is of a small order, u„ can be allowed to be o{n"') 
and thus logp = o(n^'^). The diverging sequence (u^) also controls the rate 
of the exponential probability bound. From Theorem 4, we see that with 

asymptotic probability one the Lqo estimation loss of (3 " is bounded above 
by /ii + /i2 , where 

(39) hi = DinCinUn{l-Cr^<J and = 4^I)2nCi„n„(l - C)"V. 

The second term /i2 is associated with the penalty function p. For the Li 
penalty, the ratio p' (cobo) / p' (0+) is equal to one, and for other concave 
penalties, as mentioned before, this ratio can be (much) smaller than one. 
This is consistent with the fact shown by Fan and Li (2001) that concave 
penalties can reduce the biases of estimates. 

In the classical setting of Din,D2n = 0{^/n) and Ci„ = 0(n~^), we have 
7 = 1/2 since p' (cQbo) / p' (0+) < 1 by Condition 1 and thus the consistency 

rate of /3 " under the L2 norm becomes Op{^/sn~^^'^Un), which is slightly 
slower than Op{^/sn~^^'^). This is because it is derived by using the Loo loss 

of /3 " in Theorem 4(b). The use of the Loo norm is due to the technical 
difficulty of proving the existence of a solution to the nonlinear equation (31). 
We conjecture that by considering the L2 norm directly, one can obtain the 
consistency rate Op{y/sn~^^'^). 

4.3. Choice of pa penalty. Theorem 4 gives one characterization of the 
role of penalty functions in regularized least squares for model selection in 
(1). Let us now consider the penalties pa introduced in Section 2.2. We fix the 
diverging sequence (u„) in Theorem 4 and see how the parameter a G (0, 00] 
influences the performance of the pa-regularized least squares method. 

In view of (11), we have 

(^0) / / u \ = 2 ' a e {0,00) and ^—— = 1. 

p'aicobo) P'oo(cofeo) 

Clearly p'a{0+) / p'ai'^obo) is decreasing in a G (0,oo]. Thus (35) in Condition 
2 becomes less restrictive as a — > 0+. In view of (39) by Theorem 4, the 
upper bound on the Loo estimation loss of the /9a-regularized least squares 
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estimator f3 " decreases to hi as a approaches 0. However, in view of (12), 
the maximum concavity of pa increases to oo as a — > 0+. This suggests 
that the computational difficulty of solving the p^-regularized least squares 
problem (4) may increase as a approaches 0. In practical implementation, 
we can adaptively choose a using the data, for example, the cross-validation 
method. 

5. Implementation. In this section, we discuss algorithms for solving 
regularization problems with pa penalty. Specifically, the pa-regularization 
problem (3) for sparse recovery in (2), and the -regularized least squares 
problem (4) for model selection in (1). 

5.1. SIRS algorithm for sparse recovery. For any (3 = {[ii , (3p)^ G R^, 
let D(/3) = diagjdi, . . . ,dp\ and 

(41) v(/3)=DX^(XDX^)+y, 

where D = D(/3) and = 13] / pa{\f3j\) = {a + l)"i|/3j|(a + |/?,-|), j = 1, . . . ,p. 
We propose the sequentially and iteratively reweighted squares (SIRS) al- 
gorithm for solving (3) with pa penalty. SIRS uses the method of iteratively 
reweighted squares and iteratively solves the /3-regularization problem (3) 
with the weighted L2 penalty p(/3) = P'^TfB, where T = diagjd^^, . . . , d"^} 
with D = 0(7) for some 7, = 00 and • 00 = 0. It sequentially searches 
for a good initial value that leads to /3o- Pick a level of sparsity S, the 
number of iterations L, the number of sequential steps M < S and a small 
constant e G (0, 1). 

SIRS ALGORITHM. 

1. Set k = 0. 

2. Initialize /S^^^ = 1 and set £ = l. 

3. Set /3(^) ^ v(/3(^-^)) with D = D(/3(^"^)) and £ ^ £ + 1. 

4. Repeat step 3 until convergence ot i = L + 1. Denote by /3 the resulting 
p- vector. 

5. If WPWo < S, stop and return /3. Otherwise, set /c <— /c + 1 and repeat 
steps 2-4 with ^3^°) = /(|3| > jk) + eI{\P\ < 7fc)^and jk the kth largest 
component of |/3|, until stop or k = M. Return (3. 

In practice, we can set a small tolerance level for convergence and apply 
a hard thresholding with a sufficiently small threshold to /3 to generate 
sparsity in step 4. The small constant e G (0, 1) is introduced to leverage the 
scoring of variables with indices in TIq = supp(/3o) by suppressing the noise 
variables. Through numerical implementations, we have found that SIRS is 
robust to the choice of e. SIRS algorithm stops once it finds a sufficiently 
sparse solution to (2). We use a grid search method to select the optimal 
parameter a of penalty pa that produces the sparsest solution. 
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5.1.1. Justification. It is nontrivial to derive the convergence of SIRS 
algorithm analytically. In all our numerical implementations, we have found 
that it always converges. In this section, we give a partial justification of 
SIRS. 

For each given step 3 of SIRS solves the /9-regularization prob- 

lem (3) with the weighted L2 penahy p{f3) = f3^T(3, where T is given by 
D = D(/3(^-i)). This can be easily shown by Proposition 1 and the iden- 
tity v(/3) = Di/2(DV2x'^XDi/2)+j)i/2xTy_ follows from the definitions 

of D and r that Pa(/9^^~^^) = (/3(^-^))^r/3(^^i) = p(/3(^^i)). Thus the two 
penalties agree at /3 = Quadratic approximations have been used in 

many iterative algorithms such as LQA in Fan and Li (2001). 

The following proposition characterizes lim^_»oo P^^^ when it exists. 

Proposition 2. (a) If lim£^^ P^^^ exists, then it is a fixed point of the 
functional T A, 

(42) T(-i) = argmin^^r(7)/3, 

l3eA 

where A = {P £ HP : y = X/3} and T{-f) denotes T given by T) = 0(7) . 

(b) Pq is always a fixed point of the functional T . 

(c) Assume that p> n, spark(X) =n + l and \\Pq\\o < (ri-|- 1)/2. Then for 
any fixed point P of the functional T , we have P = Pq or ||/3||o>(n-|-l)/2. 

5.1.2. Computational complexity. We now discuss the computational com- 
plexity of the SIRS algorithm. We first consider the case oi p>n. Note that 
DX^(XDX^)+y = limA^o+DX'^(A/n + XDX^)-V- Thus in step 3, we 
can avoid calculating the generalized matrix inverse and only need to cal- 
culate the p-vector DX-^(A/„ -|- XDX^)^^y for a fixed sufficiently small 
A € (0,00). Since the p x p matrix D is diagonal, the computational com- 
plexity of XIn + XDX^ is 0{n'^p). It is known that inverting an n x n 
matrix is of computational complexity O(n^). This shows that the com- 
putational complexity of {XIn + XDX"'")"^ is 0{n'^p) since p > n. Then 
it is easy to see that step 3 has computational complexity 0{n'^p). Simi- 
larly, when p < n we can derive that step 3 has computational complex- 
ity 0(V) by using the identity v(/3) = DV2(Di/2x^XDV2)+Di/2x^y, 
which equals limx^o+'D^^^{>^Ip + 'D^^'^^^X.'D^^^y^'D^^^X.^y. Therefore, 
the computational complexity of the main step, step 3, of the SIRS algorithm 
is 0{np{n Ap)), which is the same as that of the LARS algorithm [Efron et 
al. (2004)] for model selection. For given number of iterations L and number 
of sequential steps M, SIRS has computational complexity 0{np{nAp)LM). 
We would like to point out that SIRS stops at any sequential step once it 
finds a sufficiently sparse solution to (2). 
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5.2. Model selection. Efficient algoritlims for solving the regularized least 
squares problem (4) include the LQA proposed by Fan and Li (2001) and 
LLA introduced by Zou and Li (2008). In this paper we use LLA to solve (4) 
with Pa penalty. For a fixed regularization parameter An, LLA iteratively 
solves (4) by using local linear approximations of J2'j=i Pa{\Pj\)- At a given 
^(0) ^ (^(0)^ _ _ . LLA approximates E^iPailPjl) as 

EMI/3f |) + p'a(l/3f l)(l/?,|-|/3f I)]. 
i=i 

Then LLA solves the weighted lasso (Li-regularized least squares) 
(43) min 1 2-1 ||y - X/3||i + A„A„ wj\(3j \ | , 

where wj = p'a{\pf\) = a{a + l)/(a + j = l,... ,p. We use a grid 

search method to tune the parameter a of penalty pa- 

6. Numerical examples. 

6.1. Simulation 1. In this example, we demonstrate the performance 
of Pa penalty in sparse recovery. We simulated 100 data sets from (2) with 
(s,n,p) = (7,35,1000), 9Jto = {1, 2, . . . , 7} and /3o,im„ = (1, -0.5, 0.7, -1.2, 
—0.9,0.3,0.55)^, for each of three levels of correlation r. Let F^ be a p x p 
matrix with diagonal elements being 1 and off-diagonal elements being r. 
We chose r = 0, 0.2 and 0.5. The rows of X were first sampled as i.i.d. copies 
from A^(0,Fr.) and then each of its columns was rescaled to have unit L2 
norm. 

As discussed in Section 5.1, we implemented pa regularization (3) using 
SIRS algorithm. We set 5 = [^] , e = and M = S. The parameter a was 
chosen to be in {0, 0.05, 0.1, 0.2, 0.3, 0.4, 0.6, 1, 2, 5}. For a comparison, we also 
implemented the Li regularization (3), which can easily be recast as a linear 
program. The optimal SICA refers to the method that tunes parameter a of 
Pa penalty by selecting the one that generates the sparsest solution. Since 
sparse recovery can be formulated as model selection, we also implemented 
Pa-regularized least squares (4) using LLA algorithm with parameters A and 
a tuned by cross-validation, and compared it with SIRS. 

Table 1 shows the comparison results. We see that pa with finite a and op- 
timal SICA significantly outperformed Li. As a gets larger, the performance 
of Pa approaches that of Li. When a approaches zero, the success percent- 
age first increases and then decreases. This suggests that the computational 
difficulty increases for very small a. 
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6.2. Simulation 2. In this example as well as the next two ones, we 
demonstrate the performance of pa penalty in model selection. The data 
were generated from model (1). We set = (100,50) and chose the true 

regression coefficients vector /3q by setting TIq = {1, 2, . . . , 7} and /3o,ano ~ 
(1,-0.5,0.7,-1.2,-0.9,0.3,0.55)^. The number of simulations was 100. For 
each simulated data set, the rows of X were sampled as i.i.d. copies from 
A^(0,So) with So = (0.5I*-Jl)ij=i,...,p, and e was generated independently 
from N{0,a^ln). Two noise levels a = 0.3 and 0.5 were considered. We com- 
pared SICA with lasso, SCAD and MCP. Lasso was implement by LARS 
algorithm, and SCAD, MCP and SICA were implemented by LLA algo- 
rithm. The regularization parameters A and a were selected by using a grid 
search method based on BIC, following Wang, Li and Tsai (2007). 

Three performance measures were employed to compare the four methods. 
The first measure is the prediction error (PE) defined as E[y — ^ j3)'^ , where 
(3 is the estimated coefficients vector by a method and x is an independent 
test point. The second measure, ^S, is the number of selected variables in 
the final model by a method in a simulation. The third one, FN, measures 
the number of missed true variables by a method in a simulation. 

In the calculation of PE, an independent test sample of size 10,000 was 
generated. All four methods had median FN = 0. Table 2 and Figure 3 
summarize the comparison results given by PE and ^^S. 

Table 1 

Success percentages of Li, pa with different a and optimal SICA (with a tuned) in 
recovering fi^ under three levels of correlation r in Simulation 1, where 
{s,n,p) = {7, 35,1000) 







SIRS 






LLA 




Methods 


r = 


r = 0.2 


r = 0.5 


r = 


r = 0.2 


r = 0.5 


Li 


11% 


8% 


4% 








P5 


19% 


12% 


8% 


28% 


26% 


20% 


P2 


50% 


37% 


32% 


53% 


40% 


34% 


Pi 


63% 


58% 


51% 


57% 


46% 


43% 


PO.6 


70% 


64% 


57% 


54% 


43% 


42% 
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74% 


69% 


62% 


51% 


38% 


38% 


PO.3 
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58% 


49% 
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Fig. 3. Boxplots of PE and #5 over 100 simulations for all methods in Simulation 2, 
where p = 50 and the rows of X are i.i.d. copies from N{0,T,o). The x-axis represents 
different methods. Top panel is for PE and bottom panel is for #S. 



6.3. Simulation 3. The setting of this example is the same as that of 
Simulation 2, except that {n,p) = (100, 600) and cr = 0.1, 0.3. Since p is larger 
than n, BIC breaks down in the tuning of A and a. Thus we used five- fold 
cross-validation based on prediction error to select the tuning parameters. 
All four methods had median FN = 0. Table 3 and Figure 4 summarize the 
comparison results given by PE and t^S. The boxplots of lasso are truncated 
to make it easier to view. 

6.4. Real data analysis. In this example, we apply SICA to the diabetes 
dataset, which was studied by Efron et al. (2004). This dataset contains 



Table 2 

Medians of PE and #S over 100 simulations for all methods in Simulation 2, where 
p = 50 and the rows ofK are i.i.d. copies from N(0,'Eo) 





Measures 


Lasso 


SCAD 


MCP 


SICA 


a = 0.3 


PE (xlQ-^) 


12.88 


9.77 


9.38 


9.97 




#s 


13 


7 


7 


7 


a = 0.5 


PE (xlQ-i) 


3.70 


2.87 


2.84 


2.80 




#s 


13 


7 


7 


7 
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10 baseline variables: age (age), sex (sex), body mass index (bmi), average 
blood pressure (bp) and 6 blood serum measurements (tc, Idl, hdl, tch, 
Itg, glu) for n = 442 diabetes patients, as well as the response variable, 
a quantitative measure of disease progression one year after baseline. We 
implemented lasso, SCAD, MCP and SICA. Five-fold cross-validation was 
used to select the tuning parameters. All four methods excluded variable 
hdl in their final models. Lasso selected the remaining 9 variables, whereas 
SCAD, MCP and SICA all selected the same 6 variables. The estimated 
coefficients by different methods are shown in Table 4. For a comparison, we 
also included the adjusted and average prediction error in five-fold cross- 
validation for each method in Table 4. We see that SCAD, MCP and SICA 
performed similarly on this real dataset, while lasso produced a different 
model. 

7. Proofs. 

7.1. Proof of Proposition 1. Let X = UDV be a singular value decom- 
position of X, where U and V are, respectively, nx n and px p orthogonal 



Table 3 

Medians of PE and #S over 100 simulations for all methods in Simulation 3, where 
p = 600 and the rows ofX. are i.i.d. copies from A'^(0,So) 





Measures 


Lasso 


SCAD 


MCP 


SICA 


cr = 0.1 


PE (xlO^^) 


2.51 


1.07 


1.07 


1.10 




#s 


71.5 


7 


7 


7 


(T = 0.3 


PE (xlO"^) 


21.15 


9.92 


9.89 


9.90 




#s 


69.5 


16 


10 


8 



Table 4 

Model coefficients obtained by all methods on the diabetes dataset, and their adjusted E? 
[-R^(adj)] and average prediction errors (APE) based on five-fold cross-validation 



Methods 


age 


sex 


bmi 


bp 


tc 


Idl 


Lasso 


-6.4 


-235.9 


521.8 


321.0 


-568.6 


301.6 


SCAD 





-226.2 


529.9 


327.1 


-757.6 


538.3 


MCP 





-226.3 


529.9 


327.2 


-757.7 


538.4 


SICA 





-219.5 


531.7 


323.3 


-743.1 


525.0 




hdl 


tch 


Itg 


glu 


i?2(adj) 


APE 


Lasso 





143.9 


669.6 


66.8 


50.73% 


2956.9 


SCAD 








804.1 





50.82% 


2939.5 


MCP 








804.1 





50.82% 


2939.1 


SICA 








800.2 





50.82% 


2935.8 



24 



J. LV AND Y. FAN 



= 0.1 



:0.3 



0.03 
0.025 



^ 0.02 



0.015 
0.01 



I 
I 

J. 



Lasso SCAD MCP SICA 
c = 0.1 



80 
60 



^ 40 



20 



I 
I 

J. 



1 ^ 
Lasso SCAD MCP SICA 




Lasso SCAD MCP SICA 
= 0.3 



80 
60 



^ 40 



20 



I 
I 

J. 



t 



Lasso SCAD MCP SICA 



Fig. 4. Boxplots of PE and #5 over 100 simulations for all methods in Simulation 3, 
where p = 600 and the rows of X. are i.i.d. copies from N{0,'Eo)- The x-axis represents 
different methods. Top panel is for PE and bottom panel is for #S. 



matrices, D is an n x p matrix with its first k diagonal elements being 
di, . . . ,dk ^ and all other elements being zero, and k = rank(X) <nAp. 
Then we have 

13 £ A ^ y = X/3 ^ U^y = = DV^ = 

(44) 

<^=^ Pi = d^ Wi, i = l,...,k, 

where 13 = . . . , f3p)'^ = V/3 and U"^y = {wi, . . . , Wn)^ ■ Since V is orthog- 
onal, II/3II2 = II/3II2 always holds. Thus it follows from (44) that 

/32=argmin||/3||2=V^^o, 

where Pq = {d^^wi, . . . , d^^ifj^, 0, . . . , 0)^. It remains to show V-^/Jg = (X'^ x 
X)+X^y . By the above singular value decomposition of X, we have (X-^X)"*" = 
V"^ diag(a) V, where a = (d^"^, . . . , d'^'^,0, . . . , 0)"^. Therefore, it is immediate 
to see that 

diag(a)D^U^y = (d^^wi, d]:'wk,0, ...,Of = Po 

and thus 

(X'^X)+X^y = V'^ diag(a) VV^D^U^y = V^^q- 
This completes the proof. 
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7.2. Proof of Theorem 1. Fix an ar bitrary 3 = . . . , G R*' and 
let OJIa = supp(/3 ). It follows from the classical optimization theory by tak- 
ing differentiation that if /3 is a local minimizer of the regularized least 
squares problem (4) with A„, = A, there exists some ^ = {vi, . . . ,Vp)'^ G 
such that 

(45) X^X3^ - X'^y + A„Av = 0, 

where for j gMx, vj = p{M) and for j vj G [-p'(0+), p'(0+)]. More- 

^ A ^ A 

over, since /3 is also a local minimizer of (4) constrained on the ||/3 ||o- 
dimensional subspace {(3 £ W : = 0} of R^, it is easy to show that 

(46) Ai„in(Q)>A„AK(p;3|jJ, 

where Q = X^ X.^ and k{p; ) is given by (8). We will see below that 

slightly strengthening the necessary condition (45) and (46) provides a suf- 
ficient condition on the strict local minimizer of (4). 

Since Q = X|^^X^^ is nonsingular, 3|jc = 0, v^^ = p0^J and ||vj^, ||oo < 

p'{0+), (45) can, equivalently, be rewritten as 

(47) = Q^'x|^^y - AnXQ^^p0k), 

(48) llz^Joo < P (0+), 

where z = (A„A)^"^X"^(y — X/3 ). Now we strengthen inequality (48) to strict 
inequality (19) and make an additional assumption (20). We will show that 

(18) , (19) and (20) imply that /3 is a strict local minimizer of (4). 

We first constrain the regularized least squares problem (4) on the ||/3 ||o- 
dimensional subspace i3 = {/3 G R^ : = 0} of R^. It follows easily from 
condition (20), the continuity of p'{t) in Condition 1, and the definition of 

k{p;P^J in (8) that the objective function in (4), £{(3) = 2~^\\y - X/3||i + 
^nJ2^=iPx{\Pj\), is strictly convex in a ball A/q in the subspace B centered 

at (3 . This along with (18) immediately entails that P , as a critical point 
of £(•) in B, is the unique minimizer of £{■) in the neighborhood A/q. Thus 

we have shown that /3 is a strict local minimizer of £{■) in the subspace B. 

It remains to prove that the sparse vector /3 is indeed a strict local 
minimizer of ^(•) on the whole space R^. To show this, we will use condition 

(19) . Take a sufficiently small ball A/i in R^ centered at /3 such that A/i H 
B C A/q. Fix an arbitrary 7^ G A/i \ A/q, we wih show that ^(71) > £(P^)- 
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Let 72 the projection of 7^ onto the subspace B. Then it follows from 
AAi n B C A/o and the definitions of i3, A/q and A/i that 72 G A/q n M, which 

entails that ^(72) > ^{(3 ) if 72 7^ /9 by the strict convexity of £{■) in the 

neighborhood A/q. We see that to prove ^(71) > i{(3 ), it suffices to show 
that£(7i)>£(72). 

It follows from the concavity of p in Condition 1 that p'{t) is decreasing 
in t S [0,00). By condition (19) and the continuity of p'{t) in Condition 1, 
appropriately shrinking the radius of the ball A/i gives that there exists some 
6 £ (0,00) such that p'{5) £ (0,p'(0+)] and for any /3 G A/i, 

(49) llwgjjoo <p'(<5), 

where w = (A„A)~^X^(y — X/3). We further shrink the radius of the ball 
A/i to less than 5. By the mean-value theorem, we have 

(50) ^(7i)=^(72)+V^^(7o)(7i-72), 

where 79 lies on the line segment joining 72 and 7^^ and 79 7^72- Since 
7ii72 G A/i and A/i is a ball centered at /3 with radius less than 6, we 
have 7o = (70,15 • • • )7o,p)"^ G A/i and |7oj| < S for any j G Note that by 
7i G A/i \ TVo, we have 

(7i - 72)gj^ = and (7^ - 73)^, / 0. 

Let S = supp(7;^) \ dJl\ / and 7^^ = (71^1, . . . , 7i^p)-^. It is easy to see 
that sgn(7oj) = sgn(7ij) for any j G Since 79 G A/i, by (49), (50) and 
||7o,OTc lloo < S, we have 

^(7i) - ^(72) = E ^S^7. = [XlX7o - X^y + A.Ap(7o,5)]^7i,5 

jes 

= -A„A[(A„A)-iX^(y - X7o)]^7i,5 
+ A„A E sgn(7o,j)p'(|70j|)7i,i 

> -A„Ap'(5)||7i,5lli + AnAE/''(l7oj|)l7i,jl 

i6S 

>-A„V(5)||7i,5lli+AnAE/''(<^)l7Ml=0, 

iGS 

where we used the fact that p'(|7o,j|) > p'('^) since p'{t) is decreasing in 
t G [0, 00) and |7o,j| < 5 for any j G 5. This shows that ^(71) > ^(72)) which 
concludes the proof. 
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7.3. Proof of Theorem 2. By (2), we have y = X/3o. As mentioned be- 
fore, to study the p-regularization problem (3), we consider the related p- 
regularized least squares problem (4) with y = X/^g , A„ = 1 and (t) = 
Xp{t), t G [0,oo). We will construct a sequence of strict local minimizers 
= {/3i, . . . , Pp)"^ G R'' of (4) for a sequence of A G (0, oo) by using Theo- 
rem 1 and show that limA^o+ P — Po- Moreover, with some careful analysis 
we show that the limit /3q is indeed a local minimizer of (3). 

Fix an arbitrary A G (0, oo) and f3'^ = (/3^, . . . , (3p)'^ G R^. By Theorem 1, 

(3 will be a strict local minimizer of (4) as long as it satisfies conditions (18)- 

(20). We prove the existence of such a solution (5 when QJTa = supp(/3 ) = 

OTTq. Since y = X/3q and k{p;P^^) < k{p) in view of (7) and (8), we can 
rewrite (18) and (19) and strengthen (20) as 

(51) 3srrto = Po,mo - AQ^^p(3oto)' 

(52) llzOT-lloo </o'(0+), 

(53) Amin(Q) > Ak(p), 

where z = X-^Xf)3j,)Q~^p(/3rr^p) and Q = X.^^^X.^q ■ Since Q is nonsingu- 
lar and k{p) < cxd by assumption, all A G (0,Ao) with Aq = Xmm{Q) / k{p) 
automatically satisfy condition (53). We now consider A G (0,Ao). Assume 
that there exists some e G {0,mmj^<)}io \Po,j\) such that (22) holds. We will 

show that there exists a solution (3rgi, G Ve to (51) and (52), where Ve = 
rijeOToi*- 1^ ~ — Note that condition (22) guarantees (52). So it re- 
mains to prove the existence of (3^^^ G Ve to (51). 

Note that p'{t) is decreasing in t G [0,oo) and thus ||p(/9ot(,)||oo < p'(0+)- 
Let s = \\Po\\o, 

h= max ||Q~"'"v||oo 
veRM|v||oo<p'(o+) 

and Ai = Aq V (e//i). We now consider AG (0, Ai). Then for any 7 G Ve, we 
have 

||AQ^^p(7)||oo < Ai||Q-ip(7)||oo < Xih < e. 
Thus by the continuity of the vector- valued function ^^(7) = 7 — Po^<mo + 
AQ~^p(7), an application of Miranda's existence theorem [see, e.g., Vrahatis 

(1989)] shows that (51) indeed has a solution Ptjji^ in Ve- 

For any A G (0,Ai), we have shown that (4) has a strict local minimizer 

f3 such that supp(/3 ) =9Jlo, (B^yi^ G Ve, and (51) holds. Note that by (51), 
||3^ - /3olloo = 0m, - (3o,mo\\oo = A||Q~V(3^^o)ll°o < A/i, 
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which entails that HmA^o+ P exists and equals /3q. It remains to prove that 
the limit is indeed a local minimizer of (3). 

In view of the choice of A and condition (22), it follows easily from (52), 

(53) and the proof of Theorem 1 that for each A G (0, Ai), /3 is the strict 
minimizer of ^(/9) = 2~^||y — X/3||2 + Ap(/9) on some common neighborhood 
C = {/3 G : /Sfgjj, G Ve, ||/3grj{c ||oo < of P^ q, for some 5 G (0, oo) indepen- 
dent of A. We will show that /3o is the minimizer of (3) on the neighborhood 
= C n ^ of in the subspace A = {f3 ^ : y = X/3}. Fix an arbitrary 

7 G M. Since 7 G ^ and 7,/3 G C for each A G (0, Ai), it follows from (51), 
y = X/3o and Q = X^Jj^^Xoto that for each A G (0, Ai), 

p(7) = A-i£(7) > = P0') + (2A)-i ||y - Xp^g 

(54) 

= p(3')+2-iAp-(3^J^Q-ip(3^J. 

Note that by ||p(3oto)IU < p'(0+), we have 

< pAfQ~'p0mo) < Amin(Q)-^||p(3ito)||i 
< A„,in(Q)-isp'(0+)2. 

Thus by limA^o+ P = /^o ^^'^ the continuity of p, letting A — > 0+ in (54) 
yields ^(7) > p{(3q). This completes the proof. 

7.4. Proof of Theorem 3. By (25) and (12), the optimal penalty Pa^ptie) 
satisfies 

'^(/5aopt(e)) = inf K(/3a)= inf 2(0"^ + o'^) 

(55) 

= sup{a G (0, oo] : Pa G Ve}- 

Thus it follows from the definition of Ve in (24) that the optimal parameter 
Oopt(e) is the largest a G (0,oo] such that the following condition holds: 

where = {XOToQ"^Pa(v) : v G Ve} and Ve = IljeOToi^H* - Po,j \ < e}- It is 
easy to see that for any e G (0, miujgfjjto l/^OjDi we have aopt(e) > since 
1 + — > oo and for each t 0, Pa{i) = t^'^0{a) as a — > 0+. This proves 
part (a). 

Part (b) follows from two simple facts. First, it is clear that aopt(e) = 
if and only if 

(56) maxmax|(x,-,u)| < 1, 
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where = {Xot^Q"^ sgn(v) : v G V^} and = rijgOToi* • I* ~ /^Ojl < e}- Sec- 
ond, for any e G (0, minjggj^Q |/3oj|), contains a single point Uq = 

XOToQ"^sgn(/3o,OTo)- 

7.5. Proof of Theorem 4- We will prove that under the given regular- 
ity conditions, there exists a solution (3 " ^ to (31)-(33) with 9JtA„ = 
supp(/9 ") =9Jto. Consider events 

£^1 = {llX^o^^lloo < ^in-Dinfj} and S2 = {\\y^^ce\\oo<UnD2nCr}, 

where Din and D2n are defined in Condition 4. Let ^ = (^i, . . . = X-^e. 
It follows from the definitions of Di„ and that 

^1 = {l^jl < ti„||xj||2cr:j G 9Ko} C 8i 

and 

^2 = {ICil < UnUjh'y-j G C £2, 

where X = (xi, . . . ,Xp). Since e ~ N{0,a'^ln) by Condition 3, we see that for 
each j = 1,. . . ,p, S^j has a iV(0, HxjlHo"^) distribution. Thus an application 
of the classical standard Gaussian tail probability bound and Bonferroni's 
inequality gives 

P{£i n £2) > P{ri n > 1 - ) + 

(57) > 1 - [2sP{V > Un) + 2{p - s)P{V > Un)] 

\/tt 



where s = \\(3q\\o and V is a standard Gaussian random variable. Hereafter 
we condition on the event £1(1 £2- Under this event, we will show the ex- 
istence of a solution /3 " G R^ to (31)-(33) with sgn(/3 ") = sgn(/3o) and 

||3^"-/3olloo<(l-co)6o. 

By Condition 4, we have A„ < An, where 

(58) A - A"^ (^2n-Dln + D2n)UnCr , j _ C^^{1 - Co)bo - UnPlnCT 

Let An be in the interval [An, An]. Since y = X/3o + e by (1), (31) and (32) 
with = 5Jlo becomes 

(59) 3oto = /3o,OTo - V, 

(60) ||z||oo < /o'(0+), 
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where v = (X^^,XOTo)"MAnA„^(3j^„) - X^^^e] and 

- ( An A„ ) ^ [X^c Xgjio (^OTo ^9^0 ) ^ ^OTo ^ ~ "^WtEj ^] • 

We first prove that (59) has a solution in = {7 G : ||7 — /3g g^j^j ||oo < 
(1 — Co) 60}. Fix an arbitrary 7 G AA. By the concavity of p in Condition 1, 
p'{t) is decreasing in t G [0, 00) and thus ||p(7)||oo < p'(co^o)- This along with 
||X^^e||oo < UnDinCr, (34) in Condition 2 and A„ < A„ yields 

||(X^^,XaKo)~'[A„A„p(7) - X^„£]||oo < (1 - co)6o, 

since />'(t; A) is increasing in A G (0, 00) by Condition 1. Thus by the continu- 
ity of the vector- valued function ^'(7) = 7 — /3o,OTo + (-^ono-^^^o)"^ [A„,A„;o(7) - 
X^^gr], an application of Miranda's existence theorem [see, e.g., Vrahatis 

(1989)] shows that (59) indeed has a solution (3^^^ in M. It remains to check 

the inequality (60) for Ptjjig G Af. In fact it can be easily shown to hold by 
(35) in Condition 2 and ||X^ce||oo < UnD2nO', for A„, = A^. 

So far we have shown the existence of a solution /3 " with A„ = A„ to (31) 
and (32) with sgn(/3 ") = sgn(/3o) and \\(3 " — /3olloo < (1 ~ co)6o under the 
event £in£2. By (58), (59) and C2„ < C^r^, letting A„ = A„ gives 



I|3oto ~/^0,OToIIoo - IK^OTo^OTo) MAnAnP(3fmo) -X^Jj^EjUoo 



+ 7(^ 



Ci„(l-C)^V. 



Note that condition (33) with A„ = A„ is guaranteed by (37) in Condition 
4 since C2n < ^ p^(iofe"o) ' These along with sgn(/3 ") = sgn(/3o), (36) in Con- 
dition 4 and (57) prove parts (a) and (b). Note that 1 — -^pu~'^e~^'^/'^ 1 

since p = o(une""/^). Thus ||/3 " — /Solb = Op(y^n~'^ii„) follows from parts 
(a) and (b) and 

II^OTo ~ /^o,9:noll2 - V^I|3ot() ~ /3o,OTolloo- 
This concludes the proof. 

7.6. Proof of Proposition 2. Part (a) follows directly from the definition 
of /fl'-^^ = v(/3'-^^^^). It is not hard to check that (3q is the unique minimizer 
of the weighted L2-regularization problem 

min/3^r(/3n)/3, 
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which entails immediately part (b). It remains to prove part (c). Clearly, it 
suffices to show that for any (3 & A, ||/3||o < (n + l)/2 implies P = Pq. We 
prove this by a contradiction argument. Suppose there exists some f3 £ A 
with ||/3||o < (n + l)/2 and/3//3o. Let 7 = /3-/3o. Then we have 7 / and 
X7 = X/3 - X/3o = 0. But 

hllo < mio + Polio <{n + l)/2 + (n + l)/2 = n + 1 = spark(X), 

which contradicts the definition of spark. 

8. Discussion. We have studied the properties of regularization meth- 
ods in model selection and sparse recovery under the unified framework of 
regularized least squares with concave penalties. We have provided regular- 
ity conditions under which the regularized least squares estimator enjoys 
a nonasymptotic weak oracle property for model selection, where the di- 
mensionality can be of exponential order. Our results generalize those of 
Fan and Li (2001) and Fan and Peng (2004) in the setting of regularized 
least squares. For sparse recovery, we have generalized a sufficient condition 
identified for the Li penalty to concave penalties, which ensures the p/Lq 
equivalence. In particular, a family of penalties that give a smooth homo- 
topy between Lq and Li penalties have been considered for both problems. 
Numerical studies further endorse our theoretical results and the advantage 
of our new methods for model selection and sparse recovery. 

It would be interesting to extend the results to regularization methods 
for the generalized linear models (GLMs) and more general models and loss 
functions. These problems are beyond the scope of this paper and will be 
interesting topics for future research. 
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